Skip to content

Fix duplicated IDs in ALTO XML when multiple pages are present #4386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jankal
Copy link

@jankal jankal commented Jan 25, 2025

When running tesseract on a list of images and validating the alto output using the schema (https://www.loc.gov/alto/v3/alto-3-0.xsd), I had the following error:

 Element '{http://www.loc.gov/standards/alto/ns-v3#}ComposedBlock', attribute 'ID': 'cblock_0' is not a valid value of the atomic type '{http://www.loc.gov/standards/alto/ns-v3#}BlockTypeID'.

Digging into the type of BlockTypeID, I found a xsd:ID restriction - meaning the BlockTypeID can only exist once in the whole document / is unique per document.

I incorporated the current page_number into the IDs thus resolving the issue.

@jankal jankal changed the title Fix alto xml duplicates IDs when multiple pages are present Fix duplicates IDs in ALTO XML when multiple pages are present Jan 25, 2025
@jankal jankal changed the title Fix duplicates IDs in ALTO XML when multiple pages are present Fix duplicated IDs in ALTO XML when multiple pages are present Jan 25, 2025
@stweil
Copy link
Member

stweil commented Jan 25, 2025

Thanks for the fix. I just wonder whether we should try to keep the current IDs for the most common use case where only a single page is processed. This could be achieved for example by

  • conditional code
  • using counters (bcnt, ...) which are not reset for each additional image
  • using an index of (page_number - 1) * 1000 + bcnt ... in the ID strings

…nt pge

This will ensure, validated ALTO XML output is generated while keeping IDs for the first page consistent as before.
@jankal
Copy link
Author

jankal commented Jan 26, 2025

Thanks for your suggestion @stweil!

I added a function GetID to generate the ID. The new GetID function keeps the current ID structure for the first page and starts to add the page_number from the second page on.

Disclosure: I haven't written CPP before, so pls check for rookie mistakes :)

@stweil stweil requested a review from Copilot April 28, 2025 07:12
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the issue of duplicated IDs in ALTO XML output when processing multiple pages by incorporating the page number into the generated IDs.

  • Introduces a new helper function (GetID) to generate IDs based on a prefix, page number, and counter.
  • Updates ID assignments for elements such as Illustration, GraphicalElement, ComposedBlock, TextBlock, TextLine, and String to ensure uniqueness.
Comments suppressed due to low confidence (1)

src/api/altorenderer.cpp:54

  • [nitpick] Consider renaming GetID to 'GenerateUniqueID' to more clearly reflect its role in creating unique IDs across pages.
static std::string GetID(char const * prefix, int page_number, int counter) {

@@ -51,6 +51,20 @@ static void AddBoxToAlto(const ResultIterator *it, PageIteratorLevel level,
}
}

static std::string GetID(char const * prefix, int page_number, int counter) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static std::string GetID(char const * prefix, int page_number, int counter) {
static std::string GetID(const char *prefix, int page_number, int counter) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code is technically fine, but my suggested change uses the coding style which is typical for Tesseract code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants